{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# NLP (自然语言处理) 中的常用知识点\n",
    "\n",
    "本文中，我们将介绍 chatbot 领域中的常见基本概念，这些术语将在未来训练 chatbot 帮助我们快速理解 NLP 训练的过程。\n",
    "\n",
    "在开始训练 chatbot 模型之前，我们先了解如何看懂一个模型评估的指标，譬如 `F1-score`, `Confusion matrix`. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 机器学习中的 F1-score\n",
    "在理解 F1-store 之前，首先定义几个概念:</br>\n",
    "\n",
    "**TP(True Positive)**: 预测答案正确</br>\n",
    "**FP(False Positive)**: 错将其他类预测为本类</br>\n",
    "**FN(False Negative)**: 本类标签预测为其他类标</br>\n",
    "\n",
    "F1分数（F1-score）是分类问题的一个衡量指标, 在 0～1 之间，公式如下:\n",
    "\n",
    "$$ F_1 = 2 \\cdot \\frac{precision \\cdot recall}{precision + recall} $$\n",
    "\n",
    "通过第一步的统计值计算每个类别下的 precision 和 recall\n",
    "\n",
    "精准度/查准率(precision)：指被分类器判定正例中的正样本的比重\n",
    "\n",
    "$$ precision_k = \\frac{TP}{TP + FP} $$\n",
    "\n",
    "召回率/查全率(recall): 指的是被预测为正例的占总的正例的比重\n",
    "\n",
    "$$ recall_k = \\frac{TP}{TP + FN} $$\n",
    "\n",
    "每个类别下的f1-score，计算方式如下：\n",
    "\n",
    "$$ f1_k = \\frac{2 \\cdot precison_k \\cdot recall_k}{precision_k + recall_k} $$\n",
    "\n",
    "通过对第三步求得的各个类别下的F1-score求均值，得到最后的评测结果，计算方式如下：\n",
    "\n",
    "$$ score = (\\frac{1}{n}\\sum f1_k)^2 $$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Confusion Matrix 混淆矩阵\n",
    "\n",
    "混淆矩阵的每一列代表了预测类别，每一列的总数表示预测为该类别的数据的数目；每一行代表了数据的真实归属类别，每一行的数据总数表示该类别的数据实例的数目。每一列中的数值表示真实数据被预测为该类的数目。\n",
    "\n",
    "如有150个样本数据，预测为1,2,3类各为50个。分类结束后得到的混淆矩阵为：\n",
    "\n",
    "<table>\n",
    "  <tr>\n",
    "    <th colspan=\"2\" rowspan=\"2\">Confusion Matrix</th>\n",
    "    <th colspan=\"3\">预测</th>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td>类1</td>\n",
    "    <td>类2</td>\n",
    "    <td>类3</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td rowspan=\"3\">实际</td>\n",
    "    <td>类1</td>\n",
    "    <td>43</td>\n",
    "    <td>2</td>\n",
    "    <td>0</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td>类2</td>\n",
    "    <td>5</td>\n",
    "    <td>45</td>\n",
    "    <td>1</td>\n",
    "  </tr>\n",
    "  <tr>\n",
    "    <td>类3</td>\n",
    "    <td>2</td>\n",
    "    <td>3</td>\n",
    "    <td>49</td>\n",
    "  </tr>\n",
    "</table>\n",
    "每一行之和表示该类别的真实样本数量，每一列之和表示被预测为该类别的样本数量."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Corpus 语料库 \n",
    "\n",
    "巧妇难为无米之炊，语料库就是 NLP 项目中的 \"米\"。这里使用的是 [awesome-chinese-nlp](https://github.com/crownpku/awesome-chinese-nlp) 中列出的中文wikipedia dump和百度百科语料。其中关于wikipedia dump的处理可以参考[这篇帖子](http://blog.csdn.net/qq_32166627/article/details/68942216)。\n",
    "\n",
    "我们需要一个规模比较大的中文语料。最好的方法是用对应自己需求的语料，比如做金融的chatbot就多去爬取些财经新闻，做医疗的chatbot就多获取些医疗相关文章。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.3"
  },
  "pycharm": {
   "stem_cell": {
    "cell_type": "raw",
    "metadata": {
     "collapsed": false
    },
    "source": []
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
